1. Objective

We aim to explore students’ behavior of beginning of term suvey in BA200 Fall 2019 class. In other words, we’re trying to understand if there are types and if those types relate to outcomes in any way. This can also relate to how we should speak to students, not how they’ll do.

2. Introduction

Clustering is a statistical technique that involves finding meaningful groups of the objects such that the objects are similar or related in a group and dissimilar to the objects in other groups.

By conducting clustering analysis, we can better understand our BOT_data and further see if students are related to each other in certain ways.

3. Exploratory Data Analysis

3.1 BoT Survey

library(tidyverse)
theme_set(theme_bw())
library(ggpubr)
library(psych)

file = read.csv("./data/BA200-Fall2019_deidentified.csv")
var_rm = c("STDNT_ASIAN_IND", "STDNT_BLACK_IND", "STDNT_HWIAN_IND", 
           "STDNT_HSPNC_IND", "STDNT_NTV_AMRCN_IND", "STDNT_WHITE_IND", 
           "STDNT_MULTI_ETHNC_IND", "STDNT_HSPNC_LATINO_IND")
data = file[,-c(3:8, 11, 12)] 
data[which(data$STDNT_DMSTC_MNRTY_DES == "International"),
     "STDNT_DMSTC_MNRTY_CD"] = 2
data[which(data$STDNT_DMSTC_UNDREP_MNRTY_DES == "International"),
     "STDNT_DMSTC_UNDREP_MNRTY_CD"] = 2
data = data %>% filter(BOT_Extraversion != -1) %>% drop_na()

demo = data[,c(1:10)]
BOT = data[, c(1, grep("BOT", names(data)))]
MT = data[, c(1, grep("MT", names(data)))]
EOT = data[, c(1, grep("EOT", names(data)))]
BOT_clu = BOT[,c(5:10, 18:20)]
describe(BOT_clu)
##                       vars   n mean   sd median trimmed  mad min max range
## BOT_Extraversion         1 551 4.66 1.46      5    4.76 1.48   1   7     6
## BOT_Procrastination      2 551 4.09 1.50      4    4.05 1.48   1   7     6
## BOT_Belongingness        3 551 3.21 1.36      3    3.17 1.48   1   7     6
## BOT_Control              4 551 3.86 1.45      4    3.85 1.48   1   7     6
## BOT_SpeakUp              5 551 4.46 1.30      5    4.51 1.48   1   7     6
## BOT_StretchGrade         6 551 4.09 1.41      4    4.10 1.48   1   7     6
## BOT_PastPositive         7 551 3.91 0.67      4    3.94 0.00   1   5     4
## BOT_PastDiverse          8 551 3.96 0.63      4    3.98 0.00   1   5     4
## BOT_PastWorkDifferent    9 551 2.92 0.78      3    2.90 1.48   1   5     4
##                        skew kurtosis   se
## BOT_Extraversion      -0.44    -0.91 0.06
## BOT_Procrastination    0.19    -1.07 0.06
## BOT_Belongingness      0.33    -0.66 0.06
## BOT_Control           -0.05    -0.92 0.06
## BOT_SpeakUp           -0.33    -0.58 0.06
## BOT_StretchGrade       0.01    -0.31 0.06
## BOT_PastPositive      -1.02     2.61 0.03
## BOT_PastDiverse       -0.78     2.65 0.03
## BOT_PastWorkDifferent  0.14    -0.27 0.03

BOT_PastPositive and BOT_PastDiverse are moderately skewed to the left, others are fairly symmetrical. Distributions of variables are light-tailed since kurtois < 3.

b_out = outlier(BOT_clu, cex = .8, plot = F)
pairs.panels(BOT_clu, bg=c("yellow","blue")[(b_out > 30)+1], pch=21,
             main = "Pair Plots of BOT Scale Data")
**Figure 1.** The matrix shows the Peasron correlation, the histograms, and locally smoothed regressions. An ellipse around the mean with the axis length reflecting one standard deviation of the x and y variables. Blue points are potential outliers.

Figure 1. The matrix shows the Peasron correlation, the histograms, and locally smoothed regressions. An ellipse around the mean with the axis length reflecting one standard deviation of the x and y variables. Blue points are potential outliers.

caps = paste0('**Figure 1.** The matrix shows the Peasron correlation, the histograms, and locally smoothed regressions. An ellipse around the mean with the axis length reflecting one standard deviation of the x and y variables. Blue points are potential outliers.')

BOT_Extraversion shows a positive correlation (r = 0.5) with BOT_SpeakUp while it shows a negative correlation (r = -0.26) with BOT_Belongingness. Others variables show little correlation with each other.

3.2 EoT Survey

EOT_clu = EOT[,c(2:14, 18:26)]
describe(EOT_clu)
##                     vars   n mean   sd median trimmed  mad  min max range  skew
## EOT_SelfIdeas          1 551 7.39 1.00   7.00    7.42 1.48 3.00   9  6.00 -0.56
## EOT_SelfTeacher        2 551 7.38 1.15   7.00    7.46 1.48 2.00   9  7.00 -0.83
## EOT_SelfListener       3 551 7.23 1.82   8.00    7.54 1.48 1.00   9  8.00 -1.40
## EOT_SelfEnacted        4 551 7.52 1.09   8.00    7.60 1.48 1.00   9  8.00 -1.22
## EOT_SelfEffort         5 551 7.65 1.06   8.00    7.75 1.48 3.00   9  6.00 -0.81
## EOT_SelfQuality        6 551 7.61 1.02   8.00    7.69 1.48 2.00   9  7.00 -1.10
## EOT_SelfBelonging      7 551 8.03 1.22   8.00    8.24 1.48 1.00   9  8.00 -2.10
## EOT_SelfReliability    8 551 8.16 1.03   8.00    8.31 1.48 3.00   9  6.00 -1.59
## EOT_SelfValuable       9 551 7.63 1.09   8.00    7.75 1.48 1.00   9  8.00 -1.26
## EOT_Extraversion      10 551 5.31 1.57   6.00    5.49 1.48 1.00   7  6.00 -0.84
## EOT_Procrastination   11 551 4.43 1.75   5.00    4.45 1.48 1.00   7  6.00 -0.25
## EOT_Control           12 551 5.03 1.69   5.00    5.18 1.48 1.00   7  6.00 -0.59
## EOT_SpeakUp           13 551 5.13 1.40   5.00    5.23 1.48 1.00   7  6.00 -0.59
## EOT_PeerBelonging     14 551 7.90 1.05   8.25    8.07 0.62 2.00   9  7.00 -2.00
## EOT_PeerEffort        15 551 7.22 1.14   7.50    7.34 1.11 1.50   9  7.50 -1.27
## EOT_PeerEnacted       16 551 7.16 1.08   7.33    7.27 0.99 2.75   9  6.25 -1.11
## EOT_PeerIdeas         17 551 7.05 1.10   7.25    7.15 1.11 1.75   9  7.25 -1.00
## EOT_PeerListener      18 551 7.16 1.24   7.33    7.29 0.99 2.00   9  7.00 -1.06
## EOT_PeerQuality       19 551 7.31 1.07   7.50    7.42 0.74 2.75   9  6.25 -1.05
## EOT_PeerReliability   20 551 7.87 1.05   8.00    8.04 0.74 3.00   9  6.00 -1.85
## EOT_PeerTeacher       21 551 7.16 1.00   7.25    7.23 1.11 2.75   9  6.25 -0.82
## EOT_PeerValuable      22 551 7.63 0.91   7.75    7.73 0.74 2.50   9  6.50 -1.34
##                     kurtosis   se
## EOT_SelfIdeas           0.95 0.04
## EOT_SelfTeacher         1.26 0.05
## EOT_SelfListener        1.41 0.08
## EOT_SelfEnacted         4.07 0.05
## EOT_SelfEffort          0.81 0.05
## EOT_SelfQuality         3.09 0.04
## EOT_SelfBelonging       6.28 0.05
## EOT_SelfReliability     3.57 0.04
## EOT_SelfValuable        3.63 0.05
## EOT_Extraversion       -0.36 0.07
## EOT_Procrastination    -1.17 0.07
## EOT_Control            -0.83 0.07
## EOT_SpeakUp            -0.31 0.06
## EOT_PeerBelonging       5.00 0.04
## EOT_PeerEffort          2.54 0.05
## EOT_PeerEnacted         1.58 0.05
## EOT_PeerIdeas           1.36 0.05
## EOT_PeerListener        1.24 0.05
## EOT_PeerQuality         1.28 0.05
## EOT_PeerReliability     4.06 0.04
## EOT_PeerTeacher         1.16 0.04
## EOT_PeerValuable        3.01 0.04

Variables are generally skewed to the left. EOT_SelfBelonging, EOT_PeerBelonging, and EOT_PeerReliability have heavy tails since kurtois > 3.

e_out = outlier(EOT_clu, cex = .8, plot = F)
pairs.panels(EOT_clu[,c(1:13)], bg=c("yellow","blue")[(e_out > 80)+1], 
             pch=21, main = "Pair Plots of EOT Scale Data")
**Figure 2.** The first 13 variables from EOT data.

Figure 2. The first 13 variables from EOT data.

caps = paste0('**Figure 2.** The first 13 variables from EOT data.')

EOT_SelfIdeas and EOT_Extraversion, (EOT_SelfEnacted and EOT_SelfQuality, EOT_SelfEnacted and EOT_SelfValuable, EOT_SelfQuality and EOT_SelfValuable are correlated (r > 0.6).

pairs.panels(EOT_clu[,-c(1:13)], bg=c("yellow","blue")[(e_out > 80)+1], 
             pch=21, main = "Pair Plots of EOT Scale Data")
**Figure 3.** Rest of the 9 variables from EOT data.

Figure 3. Rest of the 9 variables from EOT data.

caps = paste0('**Figure 3.** Rest of the 9 variables from EOT data.')

EOT_PeerEffort, EOT_PeerEnacted, EOT_PeerIdeas, EOT_PeerQuality, EOT_PeerTeacher, and EOT_PeerValuable are highly correlated.

4. Clustering Techniques

4.1 K-means Clustering

We apply standard K-means algorithm on continuous variables extracted from BoT survey. The variables include BOT_Extraversion, BOT_Procrastination, BOT__Belongingness, BOT_SpeakUp, BOT_Control, BOT_StretchGrade, BOT_PastPositive, BOT_PastDiverse, and BOT_PastWorkDifferent.

Since k-means suffers from the initial centroids problem, meaning that the randomly selected intial centroid will cause the resulting clusters to vary a lot, we let the algorithm run 20 times under Euclidean distance and return the clustering partition that corresponds to the smallest total within-cluster sum of squares. The computational cost for K-means is low but its limitations include sensitive to outliers and difficulties with clusters of different sizes, densities, and non-spherical shapes.

#80: ---------------------------------------------------------------------------
# K-means: ---------------------------------------------------------------------
set.seed(123)
k_max = 8
wss = sapply(1:k_max, function(k) kmeans(BOT_clu, k, nstart = 20)$tot.withinss)
plot(1:k_max, wss, type="b", pch = 19, xlab = "Number of clusters K",
     ylab = "Total within-clusters sum of squares")

The elbow plot suggests that k = 4 or k = 5 clusters are reasonable since their variance within clusters are small. Hence, we try both cluster and the sample size for each cluster is shown below.

cluster 1 2 3 4 5
k = 4 131 176 119 125
k = 5 120 110 104 102 115
# k = 4
set.seed(123)
km4 = kmeans(BOT_clu, 4, nstart = 20)
#table(km4$cluster)

# k = 5
set.seed(123)
km5 = kmeans(BOT_clu, 5, nstart = 20)
#table(km5$cluster)

4.2 Hierarchical Clustering

Hierarchical clustering is also a common approach to use in supervised learning. It’s an agglomerative (bottom-up) approach, which differs from the partitional clustering like K-means. They are various ways to define the inter-cluster dissimilarity. Three most-widely used measures are complete-linkage, single-linkage and average-linkage. Complete-linkage is robust to outliers while single-linkage is not. On ther other hand, single-linkage can handle diverse sizes of clusters while complete-linkage tends to break large clusters. Average-linkage is somewhere in between. However, all three measures has a preference for spherical clusters.

hc_complete = hclust(dist(BOT_clu), method = "complete")
hc_average = hclust(dist(BOT_clu), method = "average")
hc_single = hclust(dist(BOT_clu), method = "single")

par(mfrow = c(1,3))
plot(hc_complete, main = "Complete Linkage", xlab = "", sub = "", cex = .9)
plot(hc_average, main = "Average Linkage", xlab = "", sub = "", cex = .9)
plot(hc_single, main = "Single Linkage", xlab = "", sub = "", cex = .9)

4.3 DBSCAN

K-means clustering and hierarchical clustering are suitable for finding spherical-shaped clusters that are compact and well-separated. However, they are also relatively sensitive to noise and outliers in the BOT_data.

DBSCAN, a density-based algorithm introduced in Ester et al. 1996, has the advantage in handling arbitrarily-shaped clusters and being robust to outliers. The idea behind DBSCAN is that for points in a cluster, their kth nearest neighbors are roughly the same distance. Two required parameters are epsilon, the radius of neighborhood around a point i , and the minimum points, the minimum number of neighbors within epislon radius. The plot below can help us determine the optimal epsilon value.

library(dbscan)
# Determine the optimal eps value
par(mfrow = c(1,1))
kNNdistplot(BOT_clu, k = 10)
abline(h = 3, col = "red", lty = 2)
abline(h = 4, col = "blue", lty = 2)

dbs = dbscan(BOT_clu, eps = 3, minPts = 5)
dbs
## DBSCAN clustering for 551 objects.
## Parameters: eps = 3, minPts = 5
## The clustering contains 1 cluster(s) and 6 noise points.
## 
##   0   1 
##   6 545 
## 
## Available fields: cluster, eps, minPts

5. Cluster Validity

5.1 Numerical Assessment

To evaluate the quality of clustering algorithms, we’re going to use the silhouette coefficients.

The silhouette coefficient for point i is defined by \[s_i = \frac{b_i - a_i}{max(a_i, b_i)} = 1 - \frac{a_i}{b_i}, \ if \ a_i \le b_i\] ,where \(a_i\) is the average distance of object i to the other objects in the same cluster and \(b_i = min_kd(i,k), d(i,k)\) is the average distance from object i to all objects k which does not contain i.

It measures how close each point in one cluster is to points in the neighboring clusters. The silhoutte coefficient ranges from -1 to 1. Value close to 1 indicates that the sample is far away from the neighboring clusters. Value close to 0 indicates ambiguity since the points, on average, are very close to the decision boundary between two neighboring clusters. Value close to -1 indicates poor clustering since those samples might have been assigned to the wrong cluster.

library(cluster)
# k-means
# unscaled
set.seed(123)
k = 3
km = kmeans(BOT_clu, k, nstart = 20)
diss = daisy(BOT_clu)
sil = silhouette(km$cluster, diss)
#summary(sil, FUN = mean)
# k = 3 - 0.15, k = 4 - 0.14, k = 5 - 0.13

# scaled
set.seed(123)
k = 3
km = kmeans(scale(BOT_clu), k, nstart = 20)
diss = daisy(BOT_clu)
sil = silhouette(km$cluster, diss)
# k = 3 - 0.14, k = 4 - 0.11, k = 5 - 0.09

# HC
#summary(silhouette(cutree(hc_complete, 3), diss))
# DBSCAN
#summary(silhouette(dbscan(BOT_clu, eps = 3, minPts = 5)$cluster, diss), col = "blue")

The silhouette coefficients for 6 different methods:

K/Eps Kmeans Kmeans(Scaled) HC(Complete-linkage) HC(Single-linkage) HC(Average-linkage) DBSCAN
3 0.15 0.14 0.03 0.25 0.22 0.26
4 0.14 0.11 0.04 0.16 0.18 0.33
5 0.13 0.09 0.07 0.08 0.14 NA

The above table shows that DBSCAN with epsilon = 4 has the highest coefficient 0.33. We also plot the K-means with k = 4 to illustrate.

plot(silhouette(km$cluster, diss), col=c("blue", "red", "orange"), border=NA,
     main = "Silhouette plot of K-means with k = 3")

5.2 Visual Assessment

In addition to numerical assessment, we provide boxplots to evaluate the clusters. Since the silhouette coefficients are generally higher when we have 3 clusters, we will look at the boxplots of each method when clusters = 3.

# k-means
km3 = kmeans(BOT_clu, 3, nstart = 20)$cluster
k3 = BOT_clu %>% 
  mutate(km3 = km3) %>%
  group_by(km3) %>%
  pivot_longer(cols = 1:9, names_to = "variable", values_to = "value") %>%
  ggplot(aes(x = km3, y = value, group = km3)) +
  geom_boxplot() +
  facet_wrap(variable~.) + xlab("K-means")

# HC
h3 = BOT_clu %>% 
  mutate(hc3 = cutree(hc_average, 3)) %>%
  group_by(hc3) %>%
  pivot_longer(cols = 1:9, names_to = "variable", values_to = "value") %>%
  ggplot(aes(x = hc3, y = value, group = hc3)) +
  geom_boxplot() +
  facet_wrap(variable~.) + xlab("Hierarchical")

# DBSCAN
d3 = BOT_clu %>% 
  mutate(dbs3 = dbs$cluster+1) %>%
  group_by(dbs3) %>%
  pivot_longer(cols = 1:9, names_to = "variable", values_to = "value") %>%
  ggplot(aes(x = dbs3, y = value, group = dbs3)) +
  geom_boxplot() +
  facet_wrap(variable~.) + xlab("DBSCAN")

Sample size for each cluster:

cluster 1 2 3
k-means 177 187 187
HC(average-linkage) 547 3 1
DBSCAN 545 6 0
ggarrange(k3, h3, d3, nrow = 1)

  • In K-means clustering, the distribution of BOT_Procrastination and BOT_SpeakUp are different among three clusters.

6. Conclusion

Next Step…

  1. Is it appropriate to use the Euclidean distance to measure the similarity and dissimilarity of the discrete BOT_data? The study (Boriah et al. 2008) suggests that the Lin, OF, Goodall3 measures are robust to outliers when measuring distance for categorical BOT_data, so it might be a good idea to try those distance measures on our BOT_dataset and see how the resulting cluster perform. (R package: https://cran.r-project.org/web/packages/nomclust/nomclust.pdf)

Attempt: Based on the numercial and visual assessment, the result didn’t differ much after trying different similarity measures. Hence, we will stick to the Euclidean distance.

  1. Figure out a way to see how different clusters relate to outcome.

Question

  1. What is outcome?

  2. Any suggestion on choosing the clustering method? What aspect of the students’ behavior should I look for?

Cait: The question we face in Tandem is: Are there any BOT_data that we could be informed by at the beginning of class that should affect how we believe a student will perform in the course, particularly in regards to how they’ll behave in a group setting? And then since we can’t quite speak to every student individually, could we speak to them in patterns? And those pattern don’t have to be completely uniform. so, if we wanted to send an email to all students after they’ve started the group project, and we wanted to give them advice on how to talk to their group members about hard topics. We will do better if we’re speaking to 5-8 groups, rather than 200 students. But, as you’ve seen from the threshold BOT_data, I might belong to more than one group!

7. Update: May 4, 2020

7.1 K-menas with K = 4

Since the research goal is to divide students into groups and be able to send them tailoring messages, we prefer to have roughlt equal number of students per group and thus use the k-means clustering result to proceed the analysis.

We look at the clustering result from each variable and as a whole first.

BOT$cluster = km4$cluster
BOT %>% 
  select(c(5:10, 18:21)) %>%
  pivot_longer(cols = 1:9) %>%
  ggplot(aes(x = cluster, y = value, group = cluster)) +
  geom_boxplot() +
  facet_wrap(name~.)
**Figure 4.** *Distributions of each variable by clusters.*

Figure 4. Distributions of each variable by clusters.

caps = paste0("**Figure 4.** *Distributions of each variable by clusters.*")

The variables, except BOT_PastDiverse, BOT_PastPositive, and BOT_PastWorkDifferent, generally have different distributions among four clusters.

I also used pricipal component analysis (PCA) to visualize the clustering result. Note that the first-three principals explained around 60% of the variance.

pc = principal(BOT_clu, 3, rotate = 'varimax')
scores = as.data.frame(pc$scores)
plot3d(scores[,1:3], col = BOT$cluster)

You must enable Javascript to view this page properly.

7.2 Interpretation

Next, we aim to see whether these four clusters have different characteristics. Since the previous exploratory analysis shows correlation between variables, I assume that there are a set of underlying factors that can explain the interrelationships among them. By identifying a set of factors, it could be easier to characterize students in different groups.

Hence, I used factor analysis to identify those factors. The number of factors was chosen using the scree plot. The values on the arrows below indicate standardized loadings for each component.

#VSS.scree(BOT_clu)
paf1 = fa(BOT_clu, nfactors=4, rotate="varimax", SMC=T, symmetric=T, fm="pa")
fa.diagram(paf1)

By comparing the four factors to the above boxplots by each cluster, interpretations of the four clusters are as follow:

  • Group 1 (n = 131, proportion = 24%): Students are generally more willing to speak up in group and expect to fit in well in class. They are also doers who are less likely to procrastinate.

  • Group 2 (n = 176, proportion = 32%): This group of students tend to be the opposite of group 1. They listen more and less willing to speak up. They are also more likely to feel out of place in class.

  • Group 3 (n = 119, proportion = 22%): They are the ones that consider sharing work is good and hope the class is challenging even though they might not get a high grade.

  • Group 4 (n = 125, proportion = 23%): They are procrastinators who value grades over learning. On average, they are extroverted and experienced similar work style to their teammates.

I also looked at how these four groups of students perform at the end of the term.

# Factor analysis
#EOT_clu = EOT[, c(2:14,18:26)]
#paf2 = fa(EOT_clu, nfactors=5, rotate="varimax", SMC=T, symmetric=T, fm="pa")
#fa.diagram(paf2)

df = inner_join(EOT, BOT[,c("user_id", "cluster")], by = "user_id")

df %>% select(1, 2:10, 18:27) %>%
  pivot_longer(cols = 2:19) %>%
  ggplot(aes(x = cluster, y = value, group = cluster)) +
  geom_boxplot() +
  facet_wrap(name~.)

df %>% select(1, 11:14, 27) %>%
  pivot_longer(cols = 2:5) %>%
  ggplot(aes(x = cluster, y = value, group = cluster)) +
  geom_boxplot() +
  facet_wrap(name~.)

  • Generally, group 1 and group 4 consider themselves doing more than their fair share of work and students in group 4 view themselves as more “valuable”.

  • Students in group 2 are good listeners either evaluated by themselves or peers.

Reference

[1] Selecting the number of clusters with silhouette analysis on KMeans clustering. From https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html

[2] Ji Zhu, 2020, Cluster Analysis, lecture notes, stats 503, University of Michigan